Some basic information about the data:
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
First, I look at the distribution of quality ranks.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Looking at the distribution of wine quality across the observations. Quality seems to be roughly normally distributed. There is a low number of observations in each tail.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: Stacking not well defined when ymin != 0
Transforming y-axis to log2 to get a sense of the number of observations in the tails.
Getting a sense of the distribution of the other different variables.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
It seems like many of the variables are somewhat normally distributed (although the binwidths are not adjusted). Adjusting binwidths very roughly in the different plots by looking at the scale on the x axis.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
After adjusting the binwidth, I’m intrigued by “residual sugar” and “alcohol”, which does not seem to be normally distributed. Also a few of the variables seems to have very long tails.
Looking more closely at the distribution of residual sugar.
## Warning: position_stack requires constant width: output may be incorrect
Log transforming the y axis:
## Warning: Stacking not well defined when ymin != 0
## Warning: position_stack requires constant width: output may be incorrect
Looking more closely at the alcohol distribution. Adding breaks on the x axis to see if there if the small “spikes” in the distribution conincide with e.g. round numbers (may e.g. be caused by the manufacturers reporting rounded numbers instead of accurate).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
##
## 8 8.4 8.5 8.6
## 2 3 9 23
## 8.7 8.8 8.9 9
## 78 107 95 185
## 9.1 9.2 9.3 9.4
## 144 199 134 229
## 9.5 9.53333333333333 9.55 9.6
## 228 3 2 128
## 9.63333333333333 9.7 9.73333333333333 9.75
## 1 105 2 1
## 9.8 9.9 10 10.0333333333333
## 136 109 162 1
## 10.1 10.1333333333333 10.15 10.2
## 114 2 3 130
## 10.3 10.4 10.4666666666667 10.5
## 85 153 2 160
## 10.5333333333333 10.55 10.5666666666667 10.6
## 1 2 1 114
## 10.65 10.7 10.8 10.9
## 1 96 135 88
## 10.9333333333333 10.9666666666667 10.98 11
## 2 3 1 158
## 11.05 11.0666666666667 11.1 11.2
## 2 1 83 112
## 11.2666666666667 11.3 11.3333333333333 11.35
## 1 101 3 1
## 11.3666666666667 11.4 11.4333333333333 11.45
## 1 121 1 4
## 11.4666666666667 11.5 11.55 11.6
## 1 88 1 46
## 11.6333333333333 11.65 11.7 11.7333333333333
## 2 1 58 1
## 11.75 11.8 11.85 11.9
## 2 60 1 53
## 11.94 11.95 12 12.05
## 2 1 102 1
## 12.0666666666667 12.1 12.15 12.2
## 1 51 2 86
## 12.25 12.3 12.3333333333333 12.4
## 1 62 1 68
## 12.5 12.6 12.7 12.75
## 83 63 56 3
## 12.8 12.8933333333333 12.9 13
## 54 2 39 36
## 13.05 13.1 13.1333333333333 13.2
## 1 18 1 14
## 13.3 13.4 13.5 13.55
## 7 20 12 1
## 13.6 13.7 13.8 13.9
## 9 7 2 3
## 14 14.05 14.2
## 5 1 1
## Warning: position_stack requires constant width: output may be incorrect
It does seem to be the case that alcohol level coincide with round numbers.
I will look more closely into the relation between alcohol content and other variables in the bivariate and multivariate analysis.
The dataset contains 4898 observations with 12 features. Quality is an integer value, but apart from that all other features are numeric (float) values.
Mean quality is 5.878 and median quality is 6. Although the quality scale varies from 1 to 10, the highest quality is 9 and the lowest 3.
Quality is the main feature of interest in this dataset.
I’m open minded as to which other features will support my investigation into the quality. I have no particular knowledge of wine chemistry, and as at the beginning of the investigation, I do not have any intuition as to which variables correlate with higher quality rankings.
N/A
As mentioned above in connection with the univariate plots most of the variables seemed to have roughly normal distributions, albeit with very long tails. Residual sugar and alcohol did not seem to be normally distributed.
I did not yet perform any operations to tidy or rearrange the date. The data seems relatively tidy, with each variable as a column and each observation as a row. However, since R studio can deal with numbering each observation (row), I removed the X column.
I started my bivariate analysis by using ggpairs to get an overview of how the different variables relate to each other.
From the plot. I’m noting that alcohol and density seems to have some degree of correlation with other variables, but that other than that there does not seem to be much correlation between the variables.
In order to investigate which chemical properties are correlated with higher quality, I decide to group the wine by quality using the summarise function to compute mean and median values for each “quality group”.
The main feature of interest is how the features of the data set relate to quality. I am therefore particularly interested in identifying features that are related to quality.
From running ggpairs to produce a scatter matrix, I recall that alcohol did have the highest correlation with quality, and I want to look into this in closer detail:
Alcohol mean values by quality:
Alcohol median values by quality:
Also density looks promising with regard to correlation and merits a closer look:
Plotting mean total.sulphur.dioxide content by quality.
Plotting median total.sulphur.dioxide content by quality.
Since the mean plot do not show variance, I decided to try looking at the relation between the different features and quality by with boxplots, since they give an indication about the distribution of the variables at each quality level. I therefore plot boxplots of all variables against quality by using grid.arrange:
I want to look further into the relation between alcohol and quality. The below plot shows alcohol level by quality.
There seems to be a tendency for lower quality wines to have lower alcohol content and better quality wines to have higher alcohol content. That being said, there seems to be quite a bith of variance - for example the lower quality wines seems to vary considerably with regard to alcohol content.
I want to investigate this further, and have a closer look at the relation between alcohol and quality by looking at a scatter plot, using jitter and transparency to avoid overplotting:
## Warning: Removed 78 rows containing missing values (geom_point).
## Warning: Removed 131 rows containing missing values (geom_point).
## whitewine$quality: 3
##
## 8 8.5 9.1 9.4 9.6 9.7 9.8 10.1 10.4 10.5 11 11.5 11.7 12.4 12.6
## 1 1 2 1 1 1 1 1 1 2 4 1 1 1 1
## --------------------------------------------------------
## whitewine$quality: 4
##
## 8.4 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.55 9.6 9.7 9.8
## 1 4 2 2 3 11 3 8 3 10 9 1 2 5 4
## 9.9 10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11 11.1 11.2 11.3
## 5 8 8 10 3 6 8 4 2 4 2 2 3 6 1
## 11.4 11.5 11.6 11.7 11.8 12 12.1 12.2 12.4 12.5 12.6 12.7 12.9 13.5
## 5 3 3 2 1 1 1 1 1 1 1 1 1 1
## --------------------------------------------------------
## whitewine$quality: 5
##
## 8 8.4 8.5 8.6
## 1 2 3 14
## 8.7 8.8 8.9 9
## 46 54 37 82
## 9.1 9.2 9.3 9.4
## 67 107 77 131
## 9.5 9.53333333333333 9.6 9.63333333333333
## 116 1 58 1
## 9.7 9.73333333333333 9.75 9.8
## 48 2 1 55
## 9.9 10 10.1 10.2
## 32 71 35 36
## 10.3 10.4 10.4666666666667 10.5
## 26 38 2 38
## 10.5333333333333 10.5666666666667 10.6 10.7
## 1 1 35 28
## 10.8 10.9 10.98 11
## 35 20 1 15
## 11.05 11.1 11.2 11.3
## 1 16 13 8
## 11.4 11.5 11.6 11.7
## 32 14 5 7
## 11.8 11.85 11.9 12
## 3 1 5 13
## 12.1 12.2 12.25 12.3
## 2 2 1 1
## 12.3333333333333 12.4 12.5 12.7
## 1 4 2 2
## 12.8 12.9 13 13.4
## 2 1 1 1
## 13.5 13.6
## 1 1
## --------------------------------------------------------
## whitewine$quality: 6
##
## 8.5 8.6 8.7 8.8
## 4 2 16 36
## 8.9 9 9.1 9.2
## 45 63 51 84
## 9.3 9.4 9.5 9.53333333333333
## 45 84 90 2
## 9.55 9.6 9.7 9.8
## 1 61 47 64
## 9.9 10 10.0333333333333 10.1
## 67 70 1 61
## 10.15 10.2 10.3 10.4
## 1 71 43 83
## 10.5 10.55 10.6 10.7
## 83 2 66 38
## 10.8 10.9 10.9333333333333 11
## 72 38 2 87
## 11.05 11.0666666666667 11.1 11.2
## 1 1 39 64
## 11.3 11.3333333333333 11.35 11.3666666666667
## 51 3 1 1
## 11.4 11.4333333333333 11.45 11.4666666666667
## 56 1 3 1
## 11.5 11.55 11.6 11.6333333333333
## 36 1 23 1
## 11.65 11.7 11.7333333333333 11.75
## 1 32 1 1
## 11.8 11.9 12 12.05
## 31 20 45 1
## 12.0666666666667 12.1 12.15 12.2
## 1 21 1 49
## 12.3 12.4 12.5 12.6
## 30 35 47 22
## 12.7 12.75 12.8 12.9
## 25 3 13 13
## 13 13.1 13.1333333333333 13.2
## 12 8 1 6
## 13.3 13.4 13.55 13.6
## 1 7 1 4
## 13.8 14
## 2 1
## --------------------------------------------------------
## whitewine$quality: 7
##
## 8.6 8.7 8.8 8.9
## 3 14 4 4
## 9 9.1 9.3 9.4
## 29 21 9 3
## 9.5 9.6 9.7 9.8
## 13 5 4 9
## 9.9 10 10.1 10.1333333333333
## 5 13 9 2
## 10.15 10.2 10.3 10.4
## 2 13 13 17
## 10.5 10.6 10.65 10.7
## 25 9 1 24
## 10.8 10.9 10.9666666666667 11
## 24 25 3 43
## 11.1 11.2 11.2666666666667 11.3
## 20 23 1 37
## 11.4 11.45 11.5 11.6
## 27 1 28 13
## 11.6333333333333 11.7 11.75 11.8
## 1 11 1 22
## 11.9 11.94 11.95 12
## 20 2 1 38
## 12.1 12.15 12.2 12.3
## 21 1 28 24
## 12.4 12.5 12.6 12.7
## 17 22 31 22
## 12.8 12.8933333333333 12.9 13
## 30 2 18 17
## 13.05 13.1 13.2 13.3
## 1 8 5 6
## 13.4 13.5 13.6 13.7
## 6 10 4 7
## 13.9 14 14.05 14.2
## 3 3 1 1
## --------------------------------------------------------
## whitewine$quality: 8
##
## 8.5 8.8 8.9 9.6 9.8 10.4 10.5 10.7 10.9 11 11.1 11.2 11.3 11.4 11.5
## 1 11 6 1 3 7 4 4 3 7 5 6 4 1 6
## 11.6 11.7 11.8 11.9 12 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 13
## 2 5 3 8 5 6 6 7 9 10 8 5 9 5 6
## 13.1 13.2 13.4 14
## 2 3 6 1
## --------------------------------------------------------
## whitewine$quality: 9
##
## 10.4 12.4 12.5 12.7 12.9
## 1 1 1 1 1
From the above plot, it appears that higher quality wines does indeed typically contain more alcohol. That being said, the jitter plot also reveals that there are observations across all alcohol percentages for each quality level - indicating a great deal of variance.
## Warning: Removed 110 rows containing non-finite values (stat_boxplot).
As stated in the univariate plots section, I started my analysis of which variables were important for wine quality with an open mind. I therefore decided to plot all the variables in a boxplot using quality on the x axis. For several of the features (e.g. alcohol), there seem to be a polynominal/quadratic relation between the quality and the feature. This is e.g. the case with alcohol, where the highest and lowest quality wines have higher alcohol content, and the medium-low wines have lower alcohol content on average.
There seems to be relations between alcohol and some other variables. In particular there seems to be a relation between alcohol and quality (the feature of interest). The correlation is 0.4355747.
## [1] 0.4355747
This trend can also be shown in a density plot:
Alcohol content, however, also seem to be related to other features, such as total.sulfur.dioxide (correlation of -0.7801376)…
## [1] -0.7801376
## Warning: Removed 3 rows containing missing values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
… and density (correlation of -0.4488921).
## [1] -0.4488921
The strongest relationship I found was the relationship between density and residual sugar. The correlation here is 0.8389665.
## [1] 0.8389665
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 3 rows containing missing values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
## $title
## [1] "Density by residual sugar"
##
## attr(,"class")
## [1] "labels"
Recalling that alcohol and density seemed to have a degree of correlation, I want to see how this relates to quality by adding color for quality:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 3 rows containing missing values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
From the plot above it seems that wines of higher quality are typically higher in alcohol and lower in density.
The plot below shows the relation between total.sulfur.dioxide and density. From running ggpairs, I know they have one of the strongest correlations between the variables. By adding color for quality, I want to see if there is some relation to quality:
## Warning: Removed 3 rows containing missing values (geom_point).
It appears higher quality wines have lower density and lower total.sulfur.dioxide.
There also appear to be some correlation between density and residual.sugar level, and I want to se how this relates to quality:
## Warning: Removed 5 rows containing missing values (geom_point).
It appears that higher quality wines tend to have less density, and less residual sugar.
I also want to investigate the relation between pH values and quality a bit further:
Except for the very highest and the very lowest quality wines, mean pH across quality groups seem to be relatively similar.
However, the shape of the distribution seem to be slightly different, which is more visible if I use facet wrap to create a separate pH density plot for each level of quality:
Very low quality wines seem to vary much with regard to pH values, whereas the highest quality wine tend to have pH values more clustered together. I wonder whether there is some kind of relation here, or whether it is simply a result of there being few observations at the extreme ends of the quality spectrum.
Total.sulfur.dioxide and free.sulfur.dioxide seem to have some degree of correlation (0.615501), and I want to examine this in closer detail, in particular how this relates to quality:
## [1] 0.615501
## Warning: Removed 94 rows containing missing values (geom_point).
It appears that higher quality wines have less total.sulphur.dioxide and more free.sulphur.dioxide. I’m at a loss as to why this might be the case, but a quick google query reveals that sulfur dioxide (SO2) protects wine from oxidation and bacteria. However, too much of it can impact taste.
From this research I understand that free and total sulfur dioxide levels are related. This leaves my curious as to whether the PROPORTION of free to total sulfur dioxide levels have an impact on quality.
I decide to create a new variable free.sulfur.dioxide.proportion which is free.sulfur.dioxide/total.sulfur.dioxide, and plot the results by quality:
## [1] -0.1747372
## [1] 0.008158067
## [1] 0.1972141
The correlation between the proportion of free.sulfur.dioxide to total.sulfur.dioxide does indeed increase by a tiny amount, but with a correlation with quality of 0.1972141, it is not a strong predictor of quality.
Since alcohol is the feature which in itself has the strongest relation with quality, I want to investigate the relation between alcohol and the free sulfur dioxide proportion and their relation to quality by adding color for quality:
From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.
In the multivariate analysis, I looked at the relation between alcohol and density, which seem to streghtne each other in terms of looking at quality. Higher quality wines are typically higher in alcohol and lower in density. The features density and residual sugar also seemed to strengthen each other in terms of looking at quality, with higher quality wines tend to have less density, and less residual sugar. THis is also true for the relation between alcohol and the proportion of free sulfur dioxide. Higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.
I found it interesting that the distribution of pH values seemed to be so different across different quality levels. Given the low number of observations at the extreme ends of the quality spectrum, however, it is hard to say whether this is a result of a genuine difference between high and low quality wines, or whether it is particular just to this sample of white wines.
N/A
As can be seen from this plot, the wines with lower quality tends to have a lower alcohol content, while higher quality wines tend to have a higher alcohol content. Given the low number of observations this trend is more apparent on the quality levels with many observations (in the quality range 4-8). For wines with quality of 9, the distribution appears to be bimodal. For wines with quality 9, there are, however, only 5 observations. The spike just after 10 % is due to ONE wine ranked 9 with an alcohol percentage of 10.4. The other wines ranked 9 have a alcohol percentage between 12.4 and 12.9. For other quality levels, outliers such as these does not affect the plot to the same degree.
This plot shows the distribution of pH values across the groups of wines with the same quality. Interestingly the distribution seems to vary, particularly for the very good wines (quality 9) and the low quality wines (quality 3).
I have created a new feature which is the proportion of free sulfate dioxide to total sulfate dioxide. From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide, and vice versa for lower quality wines.
The white wine data set contains information on 4898 white wine variants of the Portuguese “Vinho Verde” wine. My overall goal with the analysis was to uncover a relation between the different features and wine quality. From my analysis of the different features of the dataset, it appears that there is a connection between some of the features and wine quality. Alcohol level in particular appears to be correlated with higher quality wines. However, even though there is some relation between the different features it was not as pronounced as the strong, linear relation between price and carat in the diamonds dataset. I would say it was a bit disappointing not to uncover a stronger relationship. However, it would on the other hand be surprising if something as complex as the subjective taste of wine could be broken down to 12 chemical properties. There are likely interactions between the chemical properties that all work out to produce the subjective experience of the wine. Some of the analysis might be influenced by the fact that there are very few observations at the extreme ends of quality. For example there are only 20 observations of wines judged to be of quality 3 and only 5 for the highest quality wines judged to be of quality 9. The data set is only related to wines from a region in Portugal. It would be interesting to investigate whether the findings in this dataset would be different if wines from a different regions or a range of regions were used. The data also seems to be limited to one year. It would also be interesting to see year on year change values, particularly as one often hear that wine producers talk about “good years” and “bad years”. It would be interesting to see if the chemical properties of the wine changes from a good year to a bad year.